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CHAPTER  1  -  INTRODUCTION 


Advances  in  computer  technology  have  made  on-line  full-text 
retrieval  systems  an  attractive  alternative  to  the  conventional 
means  of  information  retrieval  by  hand.  However,  since  most  such 
systems  allow  searchers  to  use  words  from  a  non-controlled 
vocabulary  as  search  terms,  the  successf ulness  of  a  search 
depends  on  a  correct  choice  of  words.  To  be  assured  of  a  good 
recall  and/or  precision  ratio,  a  searcher  has  to  include  the 
right  words  together  with  all  their  spelling  variants,  synonyms 
and  similar  meaning  phrases  -  a  not  too  realistic  demand. 

Fortunately,  the  problem  is  alleviated  to  some  extent  due 
to  the  iterative  nature  of  on-line  retrieval  systems  and  the 
fact  that  most  retrieval  systems  do  allow  some  form  of  prefix, 
infix,  and/or  suffix  truncation  for  the  search  terms.  But  often 
enough,  a  searcher  can  only  find  his  desired  information  with 
much  difficulty  and  unnecessary  delays.  At  other  times,  a 
searcher  can  not  even  converge  a  search  simply  becasuse  he  can 
not  come  up  with  enough,  or  any,  alternate  or  more  precise 
words  for  his  search  expression. 

We  believe  at  least  two  tools  can  be  used  to  circumvent 
this  predicament.  One  is  to  provide  searchers  with  an  on-line 
thesaurus.  The  other  is  to   provide   a   facility   which   enables 


searchers  to  selectively  scan  over  meaningful  words  from  thus 
far  retrieved  documents.  The  purpose  of  both  is  to  suggest  to 
searchers  pontential  clues  for  search  terms. 

Various  work  has  been  done  in  generating  thesauri,  but  not 
much  work  has  been  done  in  the  other  area.  While  it  is  still 
debatable  whether  any  practical  and  elaborate  thesaurus 
system  is  easily  implementable,  a  word-scanning  facility  is 
comparatively  simple  to  design  and  implement.  Used  wisely,  it 
can  be  a  very  effective  means  to  accelerate  and 
converge  searches.  The  main  design  consideration  with  such  a 
facility  is  a  clever  word  selection  and  screening  mechanism. 

This  thesis  describes  the  implementation  of  such  a  facility 
on  an  experimental  on-line  retrieval  system  -  Eureka,  and 
discusses  its  two  potential  uses.  Firstly,  it  can  be  used  as  an 
effective  search  aid  for  on-line  searchers  to  obtain  good, 
discriminating  search  terms  to  improve  their  search  strategy. 
Secondly,  it  can  be  used  towards  automatic  or  semi-automatic 
system  generation  of  thesaurus  classes  consisting  of  synonyms, 
spelling  variants,  similar  meaning  phrases  and  related  words. 

Chapter  2  describes  our  word  scanning  system  -  HISTOGBAM, 
its  purpose,  facilities,  language  design,  syntax  and  usages  as 
an  on-line  search  aid.  chapter  3  describes  the  systems  aspects 
of   HISTOGRAM.  Chapter  4  discusses  how   HISTOGRAM   can   be   used 


towards  automatic  or  semi-automatic  generation  of  thesauri. 
Chapter  5  concludes  this  discussion  by  suggesting  possible  areas 
of  further  improvement  and  raising  some  open-ended  questions  for 
future  research. 


For  the  purpose  of  this  discussion,  a   list   of   terms   and 
definitions  used  in  our  context  is  included  in  Appendix  A. 


CHAPTER  2  -  HISTOGRAM. ..A  WORD-SCANNING  FACILITY 


2.1  The  Need  for  an  Intelligent  Word  Scanning  Facility 

On  many  occasions  during  an  information  search,  a  user 
would  like  to  be  able  to  look  at  the  contents  of  the  responding 
documents.  By  looking  at  the  text,  a  user  can  usually  tell 
whether  he  is  going  the  right  direction  in  converging  his 
search.  If  not,  he  will  have  saved  considerable  time  and  effort 
by  changing  his  search  strategy  at  an  early  stage.  In  addition, 
being  able  to  glimpse  over  the  text  of  the  responding  documents 
can  make  the  user  aware  of  other  potentially  good  words  to  use 
or  add  to  his  search  expression,  other  synonyms  or 
synonymous  phrases  that  he  has  not  considered  or  has  simply 
overlooked. 

However,  since  the  number  of  initial  responding  documents 
is  usually  very  large,  often  on  the  order  of  tens,  or  even 
hundreds,  users  will  be  reluctant  to  look  at  the  actual  text  of 
the  document  set.  To  take,  as  an  example,  a  responding  document 
set  of  50  documents  with  an  average  size  of  1,000  words  each, 
the  document  set  would  then  contain  50,000  words.  Obviously, 
not  too  many  users  will  have  the  patience  to  go  through  these 
words.  Even  if  they  do,  it  is  doubtful  whether  any  significant 
information   can   be  extracted,   given   the   amount  of  built-in 


noise,  duplicate  and  irrelevant  words  present.  For  this 
reason,  an  intelligent  word  scanning  facility  with  a  good  word 
screening  and  selection  mechanism   is  needed. 

HISTOGRAM,  our  version  of  a  word  scanning  facility,  is  an 
implementation  that  attempts  to  provide  the  above  mentioned 
capabilities . 

2.2  HISTOGRAM  and  the  Zipf  Distribution 

Our  design  of  the  HISTOGRAM  word  screening  and  selection 
mechanisms  draws  from  the  concept  of  the  Zipf  distribution 
of  words  in  natural  English  language  documents.  George 
Kingsley  Zipf,  a  mathematician,  made  the  interesting  observation 
that  for  any  collection  of  natural  English  language 
documents,  f,  the  frequency  of  occurrence  of  the  n-th  rank  type, 
where  type  is  a  distinct  word  and  rank  is  the  standing 
of  this  type  in  descending  order  of  frequency  of 
occurrence,  is  approximately  equal  to  k/n,  where  k  is  the  number 
of  occurrences  of  the  most  frequently  occurring  type. 
Mathematically,  this  can  be  expressed  as  f  =  k/n. 

The  general  shape  of  a  Zipf  distribution  curve  is  shown  in 
Figure  2.1.  The  area  under  the  curve  approximates  the  number  of 
words  in  the  document  set.  The  non-linearity  of  the  curve 
suggests   two   things.   The  left  end   of  the  curve  suggests  that 
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for  any  collection  of  documents,  a  large  percentage  of  tokens, 
where  a  token  is  a  single  occurrence  of  a  type,  corresponds  to 
only  a  small  percentage  of  types  in  the  document  set.  These  are 
the  highest  ranking  types  in  the  set.  The  right  tail  of  the 
curve  suggests  that  for  any  collection  of  documents,  a  large 
number  of  types  have  a  very  low  freguency  of  occurrences,  or 
token  frequency,  in  the  set.  These  are  the  lower  ranking  types 
in  the  set.  As  words  from  both  of  these  groups  are  either  too 
general  or  too  specific  to  be  of  much  use  as  search  terms,  they 
are  really  noise  words  and  can  be  cut  off  from  the  group  of 
words  to  be  displayed  to  the  searcher. 

2.3  The  HISTOGRAM  Screening  and  Selection  Mechanisms 

The   screening   mechanism   of   HISTOGRAM   consists    of    2 
concurrent  processes. 

The  first  process  is  merging,  which  is  a  reduction  of  words 
to  types.  An  upper  bound  average  reduction  ratio  of  150  can  be 
achieved  from  this  merging  process  for  the  State  Statutes 
data  bases.  (This  figure  is  arrived  at  from  the  fact  that  the 
data  base  has  an  average  token  to  type  ratio  of  150[6].  However, 
since  we  are  only  dealing  with  subsets  of  the  data  base  at 
any  given  time,  we  would  expect  the  ratio  to  be  much  lower.) 
While  this  reduction  process  does  not  exactly  eliminate 
words  off  the  higher  end  of  the  Zipf  curve,   they  are  reduced  to 


only  a  relatively  small  number  of  types. 

Another  screening  process  done  concurrently  with  the 
merging  is  the  screening  of  the  lower  end  Zipf  curve  words  from 
the  documents  to  be  merged.  This  effectively  is  a  low  frequency 
cutoff  from  the  merged  document  set  and  further  reduces  the 
number  of  types  by  a  significant  percentage.  Assuming  that  the 
upper  bound  for  this  percentage  to  be  10%,  the  average  reduction 
ratio  achieved  by  these  combined  processes  alone  would  then  be 
as  high  as  1 6*7 .  To  take  our  previous  example  of  the  50-document 
set,  the  initial  HISTOGRAM  screening  process  would  then  have 
reduced  the  number  of  words  to  be  displayed  from  50,000  to  as 
low  as  300  words! 

After  the  initial  screening  process  which  reduces  the 
number  of  words  to  be  displayed  to  a  much  more  manageable  size, 
the  user  can  then  choose  to  manipulate  and  scan  these  words 
selectively  with  the  HISTOGRAM  word  selection  mechanism.  This 
mechanism  consists  of  a  statistics  table  and  a  windowing 
facility  through  which  he  can  view  the  words.  The  statistics 
table  displays  distribution  statistics  of  the  types  with  respect 
to  the  token  frequency  and  how  these  types  within  each  interval 
of  the  token  frequency  domain  are  distributed  among  the 
documents.  By  defining  token  frequency  as  the  frequency  of 
occurrences  of  a  distinct  type,  and  document  frequency  as 
the  number   of   documents   a   particular   type   appears   in,   we 


can  interpret  this  statistics  table  as  a  3-dimensional  histogram 
as  in  Figure  2.2.  The  histogram  on  the  front  is  a 
distribution  histogram  of  the  types  within  a  domain  of  token 
frequency.  The  histograms  on  and  parallel  to  the  left  side  are 
the  distribution  histograms  of  the  types  within  the  document 
frequency  domain  for  the  corresponding  interval  of  token 
frequency  on  the  front  plane. 

The  histogram  on  the  front  plane  is  in  fact  another  way  of 
looking  at  the  shape  of  Zipf  curve  of  a  document  set.  This 
different  arrangement  is  more  revealing  and  appropriate  for  our 
purpose.  The  2  domains  of  this  3-dimensional  histogram,  the 
token  frequency  and  the  document  frequency,  are  user-definable. 
This  enables  the  searcher  to,  in  effect,  look  at  all  or  zoom 
into  any  part  of  the  Zipf  distribution  curve  with  the  additional 
capability  of  seeing  how  the  types  distribute  among  the 
documents  within  intervals  of  the  part  examined.  With  the  aid  of 
this  table,  the  user  is  then  able  to  determine  how  the 
types  are  distributed  and  which  groups  of  types  he  wants  to 
see.  Then,  with  the  very  same  parameters,  the  token  frequency 
and  the  document  frequency,  the  user  can  select  the  types  he 
wants  to  inspect. 

2.U  EUREKA  and  the  HISTOGRAM  Subsystem 

HISTOGRAM  is  implemented  as  a   subsystem  in   EUREKA   -   an 


10 


>- 
o 

z 

UJ 

3 

o 

LU 


< 
o 

CO 

o 

CO 

S 

Q 
I 

CO 

tz: 

E- 
I 

• 
CM 


ui 

o 


11 


experimental  on-line  full  text  retrieval  system  designed  at  the 
University  of  Illinois  by  a  research  group  under  Prof.  David  J. 
Kuck.  The  system  is  PDP-11  based,  running  under  a  multi-process, 
multi-user  executi ve[  3  ].  The  current  hardware  configuration  is 
made  up  of  a  e^K  PDP-II/UO,  2  disk  drives,  2  CRT's  and  1 
printer.  Presently,  the  data  bases  used  are  the  various  state 
statutes. 

The  EUREKA  query  language  is  made  up  of  9   commands   in    4 
main  functional  areas  as  follows:- 

(1)  Finding  an  arbitrary  complex  Boolean  search  expression 
from  a  defined  document  set  with/without  context 
specification. 

(2)  Printing  selected  portions  of  documents  and  information 
about  preceding  queries. 

(3)  Auxiliary  functions  such  as  defining  and  deleting 
document  sets,  query  sets,  macros  and  comments. 

(U)  Logon,  logoff. 

A  thorough  description  of   the   EUREKA   system   and   query 
language  can   be   found  in  [4]. 
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2.5  Language  Design  and  Description 

To  be  in  line  with  the  language  design  of  EUEEKA,  the 
HISTOGRAM  commands  are  simple,  few,  but  effective.  Apart  from 
the  entry  and  exit  commands,  there  are  only  4  HISTOGRAM 
commands.  The  syntax  of  the  commands  is  straightforward  and 
minimal;    no   verbose,    English-like   commands  nor  exhaustive 

options  exist  to  confuse  the  users. 

i 

In  the  description  of  the  HISTOGRAM  commands  that  follows, 
the  following  notations  for  syntax   specification   are  used:- 

(1)  Capital  letters  or  commas   are  the   characters   to   be 
typed  as  part  of  the  command. 

(2)  An   underscored   character   means   the   character    is 
mandatory  for  the  command. 

(3)  Words   in  small   letters   represent   parameters   whose 
values  are  to  be  supplied  by  the  user. 

(4)  A  set  of  parallel  entries  enclosed  by  2   vertical   bars 
means  any  but  only  one  of  the  entries  applies. 

2. 5. 1  HIST  Command 
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Syntax; 


HISTOGRAM 


Function:   Entry  from  EOREKA  to  the  HISTOGRAM  subsystem. 

2. 5.2  MERGE  Command 

Syntax:     MERGE    |  eureka  document  set  name   | 

I  LAST  \ 

I  [  document  number  list  J_  ( 

-  eureka  document  set  name  is  any  EUREKA  defined 
document  set 

-  LAST  is  the  last  document  set  defined  in  EOREKA 

-  document  number  list  is  a  user  defined  list  of 
document  numbers  in  the  data  base,  separated  by 
commas  with  no  embedded  blank 

Function:   Produces  a  resultant   merged   file  of  alphabetically 

ordered  types   and  statistics  from  the  document  set. 

The  completion  of  the  merge  process  is  signalled   by 
the  system  prompt  character  '.'. 


A  maximum  of  only  1  resultant  merged  file  exists  for 
a  user  at  any  given  time.  Any  subsequent  merge 
will  erase  the  previous  resultant  merged  file.  Any 
merge    command   that     requests   a   non-existent 
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document  set  or  a  non-existent  document  will  be 
ignored  and  no  error  message  will  be  issued. 
However  the  previously  saved  resultant  merged  file, 
if  any  exists,  will  not  be  scratched.  Error 
messages  will  be  issued  for  the  other  invalid 
requests. 

Examples:   ME  [1,7,23,15]   (Merges  documents  1,7,23  and  15.) 
ME   LAST         (Merges   the    last   EUREKA 

defined  document  set.) 

2.5.3  STATS  Command 

Syntax:     STATS   JJ.tf  ,ut  f  ,ldf  ,udfj_ 

-  Itf  is  the  user  specified  lower  bound  for  the  token 
frequency 

-  utf  is  the  user  specified  upper  bound  for  the  token 
frequency 

-  Idf  is  the   user   specified   lower   bound   for   the 
document  frequency 

-  udf  is  the   user   specified   upper   bound  for   the 
document  frequency 

Default:    Itf  -  1 

utf  -  65535 
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Idf  -  1 

udf  -  the  total  number  of  documents  in  the  data  base 

Subsitution  implied  with  the  absence  of  an  operand  or 
operands  between  commas,  or  with  a  premature  ']'.  No 
embedded  blanks  are  allowed. 

Function:  The  STATS  command  displays  a  distribution  statistics 
table  of  all  the  types  in  the  merged  file  with  the 
user-specified  domains  of  token  and  document 
frequencies.  The  table  is  an  8*8  table  with  intervals 
scaled  on  the  two  domains.  By  wisely  varying  the 
upper  and  lower  bounds  of  these  parameters,  the  user 
has  in  effect  a  zoomable  viewing  device  on  the 
distribution  statistics  of  the  types. 

Command  ignored  with  invalid  domains  or 
non-existent   merged  file. 

Display:  Figure  2.3  is  an  example  of  the  statistics 
distribution  table  displayed  by  STATS.  Token 
frequency,  whose  domain  is  bounded  by  the 
user-specified  values  or  through  default,  is  scaled 
into  8  equal  integral  intervals  with  the  last 
interval  possibly  extended  or  truncated  to  the  upper 
bound.   The   document   frequency   domain  is  similarly 
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scaled.  Each  entry  under  the  column  labelled  'TYPES' 
shows  a  count  of  the  types  occurring  in  the  token 
frequency  interval  specified  in  the  rightmost  column 
of  each  row.  The  'TOTAL'  displayed  at  the  bottom  of 
the  'TYPES*  column  shows  the  total  number  of  types  in 
the  specified  token  frequency  domain.  The  entries  in 
the  columns  between  the  leftmost  and  rightmost  ones 
show  the  occurrence  frequencies  of  types  in  the 
specific  document  frequency  interval. 

Example:  In  response  to  the  command  STATS  [1,380,1,15], 
HISTOGRAM  will  produce  a  display  similar  to  the  one 
in  Figure  2.3 


TOKEN  FREQ  1 
1 
50 
100 
150 
200 
250 
300 
350 
380 


DOC  FREQ 


2 

u 

6 

8 

10 

12 

m 

15 

TYPES 

30 

40 

35 

50 

as 

12 

12 

7 

231 

5 

20 

30 

27 

23 

5 

4 

6 

120 

12 

8 
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12 

16 

24 
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20 

10 

11 
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6 

70 

1 

1 

12 

10 

18 
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3 

52 
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4 

6 
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3 

0 

3 

0 

0 

7 

4 

18 

0 

0 

0 

H 

0 

2 

2 

2 

10 

NO.  OF  TYPES  IN  THE  MERGED  FILE 


404 


FIGURE  2.3 
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The  display  is  a  statistics  distribution  table  for 
types  in  the  merged  file  with  token  frequency  from 
and  including  1  to  380  and  document  frequency  from 
and  including  1  to  15- 

As  an  example,  row  5  in  the  table  would  mean  the 
following:-  There  are  70  types  altogether  that  occur 
between  151  and  200  times  inclusive.  Among  the 
types,  20  occur  from  1  to  2  documents,  10  from  3  to  U 
documents,  11  from  5  to  6  documents,  9  from  7  to  8 
documents,  3  from  9  to  10  documents,  7  from  1 1  to  12 
documents,  4  from  13  to  14  documents  and  6  in  15 
documents,  summing  up  to  70. 

2. 5. 4  LIST  Command 

Syntax:     LIST  _^ltf ,utf , Idf , udf J^ 

-Itf ,utf ,ldf ,utf  as  in  2.5.3 

Default:    as  in  2.5.3 

Function:   List  the  types  in  the  merged  documents   that   satisfy 
the  user  requested  thresholds  on  frequencies. 


Display:    Selected  types  are  displayed   four   in   a   row 
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alphabetically. 

Examples:   (1)  L  [5,10,4,4]  (Lists  types  that  occur  from  5  to  10 

times  in  the  documents  and  only  in  4 
documents.) 

(2)  L  [,200]  (Lists  types  that  occur  between  1  to  200 
times  in  all  the  documents  of  the 
data  base.) 


2.5.5  SEARCH  Command 

Syntax:     SEARCH  '  character  string  ' 

Function:  The  search  command  initiates  a  search  through  the 
merged  file  for  the  type  specified  in  the  operand.  If 
a  hit  is  made,  the  type's  token  and  document 
frequencies  in  the  documents  merged  previously  will 
be  displayed  along  with  the  text  of  the  type.  If  not 
found,  a  message  to  that  effect  will  be  generated. 


Examples:   (1)  SE  'JUNK'  (Searches  for  the  type  'JUNK'.) 

JUNK  [132,4]  (The  type  'JUNK'   occurs   132   times 
in  4  documents  in  the  document  set.) 
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(2)  SEARCH  'BIGFOOT* 

BIGFOOT    NOT  FOUND 


2.5.6  EXIT  Command 

Syntax:     |  E_X  | 
I  ED_  I 

Function:   Exit  from  HISTOGRAM  to  EUREKA. 

EX   is   an   exit  with  the   merged   file   saved   for 

subsequent  use. 

ED  is  an  exit  with  the  merged  file  scratched. 

2.6  Usages  of  HISTOGRAM  as  a  Search  Aid 

There  are  many  ways  HISTOGRAM  can  be  used  in  assisting  a 
searcher  to  converge  his  search. 

By  using  the  SEARCH  command,  a  searcher  can  evaluate  the 
the  usefulness  of  the  terms  in  his  search  expression.  Words  that 
have  relatively  high  token  and  document  frequencies  are  probably 
too  general  to  be  good  search  terms.  Likewise,  words  with 
relatively  low  token  and  document  frequencies  are  probably  too 
specific  to  be  of  any  use.  On  the  other  hand,  words  with  a 
significant   token   frequency   and   a   not   too   high    document 
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frequercy  are  words  that  appear  a  lot  of  times  in  a  small  number 
of  documents  and  are  pontentially  good  terms  to  converge  a 
search. 

By  using  the  STATS  command,  a  user  can  feel  out  the  general 
characteristics  of  the  document  set  before  he  sets  out  to 
selectively  look  at  the  word.  From  the  statistical  table,  one 
can  have  an  idea  of  how  the  documents  in  the  document  set  relate 
to  each  other.  If  a  significant  amount  of  the  medium  token 
frequency  words  appear  in  a  great  percentage  of  the  documents, 
the  documents  within  the  set  are  probably  closely  related  to 
some  common  topics.  On  the  other  hand,  if  a  significant  amount 
of  medium  token  frequency  words  appear  with  a  small  document 
frequency,  the  document  set  is  probably  made  up  of  clusters  of 
related  documents.  However,  when  the  medium  token  frequency 
types  are  quite  evenly  distributed,  it  could  mean  that  we 
might  have  a  set  of  unrelated  documents  at  our  hand  and  should 
probably  re-evaluate  our  search  strategy. 

As  a  matter  of  fact,  there  are  many  ways  that  the 
statistics  can  be  interpreted  and  used  to  help  a  searcher.  Once 
the  user  has  an  idea  of  what  scope  of  words  he  wants  to  see,  he 
can  select  them  for  inspection  with  the  proper  parameters  in  the 
LIST  command. 

Though  generally  words  with  medium  token   frequencies    are 
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our  main  focus  of  attention,  it  is  certainly  conceivable  that 
other  words  could  be  useful  on  other  occasions.  To  summarize, 
what  we  have  cited  are  only  a  few  instances  of  how  we  think 
HISTOGBAM  could  be  used  and  are  definitely  not  exhaustive. 
Different  goals  and  stages  in  our  search  might  call  for  differnt 
approaches  to  using  HISTOGRAM.  Hopefully,  with  practice  and 
experience,  users  can  find  new  ways  of  intepreting  and  using 
HISTOGRAM  to  improve  their  search  strategy. 
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CHAPTER  3  -  HISTOGRAM  DESIGN  AND  DESCRIPTION 


3.1  Design  Philosophy 

HISTOGRAM  was  designed  as  an  autonomous  subsytem  of  EOREKA 
with  a  minimal  number  of  interface  paths.  As  it  is  experimental 
in  nature,  one  of  our  main  design  concerns  was  to  make  the 
code  straightf oward  and  easy  to  modify  and  maintain.  To  this 
effect,  the  technique  of  modular  programming  was  used 
extensively.  The  HISTOGRAM  subsystem  is  broken  down  into 
command  processing  modules,  and  each  command  processing  module 
is  in  turn  further  granularized  into  functional  units  of 
smaller  programming  entities  in  the  form  of  macros  and 
subroutines.  Every  function  that  might  be  changed, 
modified,  replaced  or  tuned  in  the  future  is  programmed  as  an 
independent  unit  so  that  any  change  can  be  done  with  minimal 
effort  and  disturbance. 

While  this  excessive  modularizing  tends  to  impose  some 
addtional  CPU  overhead,  this  should  not  affect  EUREKA  and 
HISTOGRAM  performance  as  HISTOGRAM  itself  is  an  I/O  bound 
procedure  and  EUREKA,  from  measurements  done  by  Milner[3],  is 
only  using  50%  of  its  CPU  cycles  anyway. 

3.2  Functional  Overview 
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HISTOGRAM  is  made  up  of  5  modules.  Except  for  the 
dispatching  module,  HSTGRM,  which  communicates  with  the  various 
command  modules,  there  exists  no  direct  interaction  among 
command  modules. 

HISTOGRAM  itself  only  communicates  with  EUREKA  in  three 
situations  -  entry,  exit  and  the  passing  of  the  EUREKA  user 
logon  block  pointer  in  order  to  obtain  a  EUREKA  defined  document 
set  list. 

Briefly,  the  5  HISTOGRAM  command  modules  perform  the 
following  :- 

(1)  HSTGRM  -  Interfacing  with  the  EUREKA  user,   the   EUREKA 

system  and  the  various  HISTOGRAM  command 
modules. 

(2)  MGR    -  Merging   documents:  the   process   of  reducing 

duplicate  words  to  distinct  words  and  summing 
up  their  token  and  document  freguencies  while 
sorting  them  in  alphabetical  order. 

(3)  STATS   -  Building     and   displaying,    according    to 

user-defined  domains,  statistical  table  for 
the  merged  document  set. 


2U 


iEmiR^BJkA 


HSTGRM 


MGR 


STATS  LISTER'  SRCHER 


Statute 


Files 


■^ 


\ 


Merged 


File 


control    flow 


data    flow 


FIG  3  . 1~HIST0GRAM  FUNCTIONAL  OVERVIEW 
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(H)    LISTER  -  Displaying  words  of  the   merged   document   set 
according  to  user-specified  parameters. 

(5)  SRCHER  -  Showing   distribution    statistics  of    any 
specific  word  within  the  merged  document  set. 

Basically,  what  HISTOGRAM  does  is  that  given  any  user 
requested  document  set,  it  will  input  already-existing 
statistics  files  corresponding  to  each  of  these  documents. 
Through  the  merger,  it  then  will  merge  them  and  at  the  same  time 
strip  the  intermediate  files  generated  during  the  merge  and 
the  resulting  merged  file  into  a  simpler  format.  The  resulting 
merged  file  is  then  input  and  acted  upon  by  other  HISTOGRAM 
modules  whenever  required. 

3.3  Files  of  HISTOGRAM 

The  input  statistics  file  and  the  output  merged  files  of 
HISTOGRAM  are  made  up  of  entries  each  containing  a  variable 
token  text  and  tagged-along  statistical  data.  These  entries 
are  ordered  alphabetically.  Entries  in  the  merged  file  and 
the  intermediate  scratch  files  have  the  same  format.  Each 
entry  contains  a  token  text  preceded  by  a  1-byte  text  length 
count,  followed  on  even  boundary  by  a  double-word  token 
frequency  count  and   a   one-word  document   frequency   count.  The 
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input  statistical  files  have  essentially  the  same  format 
without  the  document  frequency  count  but  with  some  other 
irrelevant  data  tagged  on  after  the  frequency  count  (Fig.  3.2). 

The  input  statistical  files  are  really  read-only  files 
while  the  merged  file  can  be  saved,  scratched  or  over-written 
with  another  merge  operation.  The  intermediate  files  generated 
are  work  files  and  are  deleted  after  every  2-way  merge. 

3.4  Description  of  the  Command  Modules 

3.4.1  HSTGRM 

HSTGRM,  the  interfacing  module  between  the  user,  EUREKA  and 
the  HISTOGRAM  modules,  is  a  command  interpreter  and  control 
dispatcher  of  the  subsystem.  It  is  entered  from  EUREKA  and 
re-enterd  from  the  HISTOGRAM  command  modules  when  they  are 
finished  with  the  requested  command.  On  user  request  to  exit,  it 
jumps  to  a  section  of  code  that  handles  the  exit  option  of 
keeping  or  deleting  the  merged  file. 

The  parameters  that  it  passes  to  the  command  modules  are 
the  EUREKA  logon  block  pointer  and  the  command  buffer  pointer. 
No  condition  flags  or  data  are  passed  back  when  control  is 
returned  from  the  command  modules. 
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3.4.2  MGR 

The  merging  module,  MGR,  consists  of  a  driver,  a  2-way 
merger,  an  input  handler  and  an  output  handler. 

The  2-way  merger  is  a  very  specialized  routine.  It  merges 
by  pulling  entries  off  and  placing  the  results  at  pre-designat ed 
locations  and  letting  the  input  and  output  handler  worry  about 
the  actual  I/O  and  the  tedious  bookkeeping  involved.  In  merging 
the  entries,  it  takes  the  smaller  of  the  two  if  the  tokens 
are  unequal.  If  they  are  equal,  it  adds  the  statistics  and  puts 
only  one  out  and  discards  the  other. 


The  input  handler  performs  the  actual  I/O  in  retrieving  the 
entries  from  the  disk  files,  does  buffer  management,  updates 
pointers  to  the  current  entries  in  buffers,  moves  entries 
from  buffers  to  the  approprate  fixed  locations,  does  low 
frequency  word  cutoffs  in  the  first  pass,  checks  for  and  acts 
accordingly  on  I/O  error  conditions  and  end-of-file  conditions. 

The  output  handler  is  comparatively  simpler.  It  moves 
entries  from  the  fixed  location  to  the  output  buffer,  updates 
pointer  and  outputs  if  buffer  is  full. 


The  driver  itself  is  the  heart  of  the  MGR  module.  It   emits 
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file   names   to   the  input  handler   for   every  2-way  merge  and 
sequences  the  overall  merging  activities  at  the  document  level. 

Initially,  it  builds  the  list  of  file  names  to  be  merged  by 
interpreting  the  EUREKA  passed  document  set  list  or  parsing 
the  user-defined  document  set  list  depending  on  how  the  merge  is 
requested.  It  then  picks  out  these  names  2  at  a  time  and 
emits  them  to  the  input  handler  for  the  2-way  merge.  It  also 
generates  file  names  for  all  the  intermediate  scratch  files.  The 
scratch  files  are  deleted  as  soon  as  they  are  entirely  absorbed 
by  the  2-way  merger.  Only  the  final  merged  file  is  not  deleted 
automatically. 

As  for  the  sequencing  function,  in  addition  to  passing 
file  names  for  the  merge,  the  driver  is  also  able  to  recognize 
the  various  situations  of  whether  it  is  the  first  pass  of  the 
merge  or  not  and  whether  the  number  of  documents  in  that  pass  is 
even  or  odd.  Different  combinations  of  these  call  for  different 
actions  and  settings  of  flags  in  order  to  optimize  control  flow 
and/or  simplify  coding. 

3.4.3  STATS 

The  STATS  command  module  is  responsible  for  building  the 
statistical  table  on  the  merged  file  for  the  user-requested 
domains. 
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When  passed  control  by  HSTGRM  the  subsystem  dispatcher, 
STATS  checks  if  the  user-specified  parameters  are  valid. 
These  parameters  are  the  user  requested  upper  and  lower  bounds 
of  the  domains  of  token  and  document  frequencies.  They  are 
considered  valid  if  they  are  numbers  and  the  upper  bound  is 
greater  than  the  lower  bound.  It  then  will  build  an  internal 
statistical  table  equivalent  to  the  one  to  be  diplayed.  The 
domains  are  scaled  into  8  equal  integral  intervals  whenever 
possible,  the  exceptions  being  the  cases  when  the  domain  itself 
is  not  large  enough  to  be  split  into  8  integral  intervals  and/or 
the  last  interval  is  extended  or  truncated  to  the  upper  bound  of 
that  domain. 

Each  slot  in  this  usually  8*8  table  corresponds  then  to  a 
particular  interval  of  document  frequency  and  token  frequency. 
They  are  used  as  counters  and  are  initialized  to  zeroes. 

STATS  then  builds  a  2-level  directory  that  maps  entries 
into  the  appropriate  slots  to  increment  the  counter.  When  this 
is  done,  the  STATS  module  simply  walks  through  entries  in  the 
merged  file  with  a  relatively  simple  input  handler  and  updates 
the  appropriate  counters  whenever  tokens  within  the  requested 
domains  are  found. 

On  end  of  file,  STATS  proceeds  to  format  the  internal  table 
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into  a  displayable  form  by  doing   the   necessary   editings   and 
conversions.   It  then  displays  the  table  to  the  user. 

The  statistics  table  that  it  builds  is  not  retained  and 
every  new  STATS  command  causes  another  round  of  table  building. 
The  merged  file  however  remains  intact  and  is  only  overwritten 
after  another  MERGE  command. 

3.4.4  LISTER 

LISTER  enables  users  to  look  at  words  in  the  merged 
documents  that  are  within  his  specified  threshold  values  of 
document  and  token  frequencies.  These  threshold  values  do  not 
have  to  coincide  with  intervals  displayed  in  the  statistical 
table  at  all. 

Again,  this  module  simply  walks  through  the  entries  in  the 
merged  file  and  picks  the  ones  that  satisfy  the  thresholds.  It 
then  moves  these  tokens  to  the  terminal  buffer,  edits  them  and 
outputs  them  whenever  the  output  buffer  is  full. 

3.4.5  SRCHER 

The  SRCHER  does  essentially  the  same  walk-through  process 
as  LISTER  does  except  that  it  is  loking  for  just  one  particular 
entry  with  the  specified  token.   When   a   hit   is   made,   SRCHER 
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converts  the  tagged-on  token  and  document  frequencies  into 
character  form  and  displays  the  token  and  its  statistics.  When 
end  of  file  or  an  alphabetically  larger  token  is  reached,  SRCHER 
gives  up  and  informs  the  user. 

At  present,  SRCHEP  can  only  handle  one  token  per  search 
request.  But  with  additional  code  that  can  parse  multiple 
entries  in  the  search  command  operand  and  sort  these  tokens,  the 
module  is  able  to  accept  multiple  tokens  per  request  without  any 
further  modification. 
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CHAPTER  a  -  THESAURUS  GENERATION 


4. 1  Conventional  Methods  of  Thesaurus  Generation 

The  generation  of  on-line  thesauri  for  full-text  retrieval 
systems  has  received  considerable  attention  in  recent  years. 
Thesauri  can  be  generated  manually,  semi-automat ically  or 
automatically.  Manual  thesaurus  generation  is  a  tedious  task 
which  involves  subjective  judgement  and  is  inflexible  to  changes 
in  the  data  base.  Automatic  and  semi-automatic  thesaurus 
generation,  on  the  other  hand,  are  more  objective,  faster 
and  more  flexible  to  changes  in  the  data  base  provided 
resonably  good  generation  algorithms  are  used.  One  problem 
though  with  automatic  and  semi-automatic  thesaurus 
generation  is  that,  regardless  of  the  generation  algorithm 
used,  a  tremendous  amount  of  computing  resource  in  terms  of 
processor  storage,  secondary  storage  and  computing  time  is 
required.  Simply  put,  the  problem  with  automatic  and 
semi-automatic  thesaurus  generation  is  that  no  simple  and 
effective  generation  algorithm  yet  exists. 

4-2  Concept  of  Thesaurus  Classes 

Before  we  proceed  any  further,  we  need  to  clarify  our 
notion  of  thesaurus  classes.  We  beleive  thesaurus  classes  should 
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not  only  contain  synonyms,  but  also  words  that  are  related  but 
not  synonymous  in  any  sense  of  the  word.  Words  that  appear  in 
the  same  phrases  or  words  that  are  always  mentioned  together 
should  also  be  considered  and  included  into  our  thesaurus 
classes.  As  an  example,  consider  the  following  group  of  words  - 
'birth*  ,  'control',  'prevention',  'abortion'  and  'pregnancy'. 
While  certainly  not  all  of  these  words  will  be  included  in  a 
conventional  thesaurus  group,  they  all  pertain  to  a  common 
topic.  For  a  searcher  who  is  looking  for  information  on 
birth  control,  it  would  be  nice  for  him  to  know  of  these 
other  possible  words  he  can  use  as  search  terms.  We  believe  that 
as  an  on-line  search  aid,  this  extended  concept  of  thesaurus 
groups  of  synonyms  as  well  as  related  words  would  serve  a 
better  purpose  and  we  should  therefore  aim  at  such  thesaurus 
classes   in   generating  thesauri. 


U.3  Our  Proposed  Approach  to  Thesaurus  Generation 

As  far  as  the  generation  process  is  concerned,  we  outline 
here  a  2-step  approach. 

The  first  step  involves  clustering  of  documents  in  the  data 
base.  The  idea  behind  clustering  is  that  documents  in  a  data 
base  can  be  broken  up  into  smaller  clusters  of  documents. 
Documents  within  a  cluster  are  loosely  related   to   some   common 
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topics.  While  a  document  can  belong  to  more  than  one  of  these 
clusters,  for  all  practical  purposes,  clusters  are  independent 
entities  with  no  inter-relationship  to  each  other  whatsoever.  As 
an  example,  a  document  on  birth  control  obviously  belongs  to 
a  different  cluster  from  one  with  a  document  on  highway 
safety  regulations. 

Generation  of  a  thesaurus  then  should  be  from  these 
subsets  of  the  data  base  rather  than  from  the  entire  data  base 
itself.  There  are  two  advantages.  Obviously,  by  reducing  the 
number  of  documents  to  a  manageable  size,  automatic  or 
semi-automatic  generation  of  a  thesaurus  should  be  a  lot 
easier.  In  addition,  generating  a  thesaurus  from  documents 
within  a  cluster  filters  out  noises  that  would  otherwise  be 
introduced  from  unrelated  documents  outside  of  the  cluster. 
As  an  example,  the  words  'law'  and  'order'  would  probably  be 
put  into  the  same  thesaurus  group  for  a  cluster  of  documents 
on  law  and  order  while  the  words  'order'  and  'magnitude'  would 
probably  be  put  into  another  thesaurus  group  in  a  cluster  of 
documents  on  computer  performance,  but  the  words  'law'  and 
'magnitude'  certainly  do  not  belong  together. 

4.3.1  Clustering  Documents 

To  cluster  documents,  we  can  make  use  of  the  concept  of 
document   vectors   where  a  document  vector  is  a  list  of  distinct 
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words  which  characterizes  and  distinguishes  a  document  from  the 
others.  Looking  at  the  Zipf  curve  of  any  particular  document,  we 
know  that  words  under  either  end  of  the  curve  are  noise  words 
of  very  little  significance.  Intuitively,  as  we  approach  from 
both  ends  to  the  inside  of  the  curve,  we  would  find  more  and 
more  significant  words.  If  some  kind  of  a  significance  level 
curve  can  be  plotted  onto  the  Zipf  curve,  it  would  probably 
resemble  the  one  in  Figure  U.I.  The  idea  of  defining  a  document 
vector  then  is  to  define  low  cutoff  and  high  cutoff  threshold 
values  of  token  frequency  so  that  we  can  obtain  a  set  of  words 
large  enough  to  characterize  a  document  but  small  enough  for 
later  manipulation  in  the  clustering  and  thesaurus  generation 
processes.  Since  the  threshold  values  are  absolute  quantities 
which  will  vary  with  document  sizes,  to  design  any  general 
algorithm  to  determine  threshold  values,  we  have  to  relate  to 
relative  quantities  such  as  distribution  percentiles. 

4.3.2  Generating  Thesaurus  Classes  from  Clusters 

Once  we  have  built  clusters  and  thereby  reduced  the  number 
of  documents  and  the  size  of  words  to  be  practical  for  thesaurus 
generation,  a  lot  of  the  existing  statistical  correlation 
techniques  to  generate  a  thesaurus  would  become  practical 
and  feasible  to  implement.  One  such  good  technique  is  the 
generation  of  thesaurus  classes  by  a  co-occurrence 
matrix.   The  co-occurrence  matrix  is  a  table  of  'd'    rows    and 
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•t'  columns,  where  * i*  is  the  number  of  documents  in  the  data 
base  concerned  and  't'  the  number  of  types  in  the  data  base. 
Each  column  in  the  table  corresponds  to  a  type  and  each  row 
corresponds  to  a  document  in  the  data  base.  A  M'  in  the 
i-th  row,  j-th  column  entry,  indicates  the  presence  of  the  j-th 
column  type  in  the  i-th  row  document,  and  a  '0'  indicates  an 
absence  of  the  type  in  the  document.  The  matrix  is  then  used  as 
a  tool  to  determine  what  types  co-occur  often  enough  in  the 
same  documents  to  be  considered  within  the  same  thesaurus 
classes. 

4.4  HISTOGRAM  Modules  and  Thesaurus  Building  in  EUREKA 

The  above  section  only  gives  an  overall  outline  of  an 
approach  to  thesaurus  generation.  A  detailed  plan  for  the 
implementation  requires  further  investigations  and  efforts. 
However,  at  this  point,  it  is  clear  that  a  lot  of  the  existing 
HISTOGRAM  modules  can  lend  themselves  to  usage  regardless  of  the 
final  plan  adopted. 

The  merging  module  of  HISTOGRAM,  for  example,  can  be  used 
to  strip  the  statistics  files  into  a  format  acceptable  by  the 
STATS  module.  The  STATS  module  can  then  compile  distribution 
statistics.  These  statistics  can  be  used  as  a  criterion  by 
some  general  algorithm  to  generate  document-dependent  high  and 
low   cutoff    values.    Once   these   values   are  determined,  the 
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LISTER  module  can  use  them  to  extract  groups  of  words  as 
document  vectors  and  output  them  to  intermediate  files  for 
later  use.  With  document  vectors  defined,  we  can  employ 
statistical  techniques  such  as  correlation  of  document 
vectors  to  cluster  documents. 

All  of  the  above  mentioned  functions  involving  HISTOGRAM 
modules  need  only  minor  modifcations  to  the  modules  concerned. 
One  nice  thing  about  the  characteristics  of  the  EUREKA  data 
bases  of  state  statutes  is  that  they  are  already  in  a  natural 
order  of  clusters.  We  can  therfore  use  these  naturally  arranged 
clusters  as  yardsticks  in  evaluating  our  clustering  techniques. 
By  experimenting  with  different  cutoff  values  to  vectorize 
documents  and  different  correlation  techniques  for  clustering, 
we  hopefully  will  arrive  at  some  satisfactory  algorithms  in 
generating  clusters  that  are  small. 

Once  we  have  good  ways  of  producing  small  clusters,  we  can 
try  out  various  correlation  techniques  with  our  "co-occurrence 
matrix"  method  of  thesaurus  generation.  Again  through 
experimentation,  hopefully  we  would  come  up  with  a  good  scheme 
of  generating  a  thesaurus. 

At  such  a  time,  we  can  then  use  the  existing  thesaurus 
feature  in  EUREKA  to  actually  build  the  thesaurus  system[5].  It 
has   yet   to  be  investigated  on  how  feasible  and  how  much  effort 
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would  be  involved  to  interface  the  HISTOGRAM  modules  with 
this  thesaurus  feature  in  order  to  be  able  to  generate 
thesaurus  classes  and  build  the  theasurus  system  completely 
automatically.  However,  at  the  very  least,  our  generated 
thesaurus  classes  can  be  entered  into  EUFEKA  manually  to 
produce  an  on-line  thesaurus. 
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CHAPTER  5  -  CONCLOSION 


We  have  described  a  facility  that  attempts  to  improve 
users'  performance  on  on-line  searches.  The  facility  is 
experimental  in  nature.  Our  initial  objective  was  to  get  the 
facility  operational  and  then  do  whatever  necessary  tunings 
and  modifications  afterwards.  It  is  believed  that  ways 
for  improvements  and  modifications,  in  both  system 
performance  and  user  interfacing,  will  become  clear  as  more 
data  and  user  feedback  are  available  once  the  system  is 
operational. 

As  can  be  seen  now,  there  are  several  areas  for  maior 
improvements. 

The  major  performance  problem  with  the  current  HISTOGRAM 
system  is  the  I/O  bottleneck.  This  bottleneck  is  mainly  caused 
by  the  large  number  of  arm  movements  necessary  for  every  2-way 
merge.  Two  things  can  be  done  in  directly  minmizing  and 
optimizing  arm  movements.  Placing  the  statistics  files  and 
intermediate  scratch  files  close  together  and/or  close  to  the 
center  of  the  disk  would  minimize  arm  movement  time.  Also,  an 
arm  movement  optimizing  scheme  in  the  choosing  of  input  and 
output  files  for  merging  would  greatly  reduce  unnecessary  arm 
movement  delays.  As  a  matter  of  fact,  the  latter  is  to  a  certain 
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extent  implicit  with  the  current  merging  algorithm.  If  a  user 
requests  a  merge  for  a  EUREKA  defined  document  set,  the  document 
number  list  passed  by  EUREKA  to  the  HISTOGRAM  merger  is  already 
arranged  in  ascending  document  numbers,  thus  by  picking  them  2 
at  a  time  from  the  start  of  the  list,  the  merger  is  picking  the 
closest  possible  pair  of  documents  for  every  merge. 

Significant  performance  improvement  can  also  be  achieved  by 
reducing  the  volume  of  I/O  for  the  merge.  This  can  be  done  by 
shrinking  the  size  of  the  statistics  files  and  the  intermediate 
scratch  files.  One  way  to  do  so  would  be  to  eliminate  the 
irrelevant  statistical  information  in  the  original  statistics 
files.  They  do  occupy  a  significant  percentage  of  the  file  space 
and  are  not  useful  information  for  our  purpose.  As  the  first 
pass  merge  I/O  constitutes  more  than  half  the  total  volume  of 
I/O,  reducing  the  size  of  these  initial  statistics  files  would 
significantly  affect  the  system  performance.  Another  way  worth 
considering  is  to  encode  the  variable  length  token  text  in  every 
entry  into  a  2-byte  code.  This  would  reduce  the  size  of  the 
statistics  files  and  the  intermediate  scratch  files  by  a 
substantial  percentage.  As  a  matter  of  fact,  by  encoding  words 
with  an  implicit  alphabetical  ordering,  comparision  of  tokens 
can  be  reduced  to  a  trivial  operation  with  the  merging 
process  greatly  simplified.  The  actual  decoding  of  the  words 
does  not  have  to  be  done  till  the  point  where  the  user 
selects   the  words   to   see,   and   even   then,    the   number 
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of    words   to   be  decoded  would   generally   be   only   on   the 
order   of   tens,   or  hundreds. 

The  merging  process  is  another  potential  area  of 
improvement.  We  have  a  tentative  low  cutoff  mechanism  for  every 
merge  in  the  fisrt  pass,  but  we  have  yet  to  introduce  a  high 
cutoff  scheme  in  this  pass.  The  reason  that  we  have  not  done  so 
is  the  difficulty  in  determining  a  threshold  value  for  the  high 
cutoff.  To  a  much  greater  extent,  the  high  cutoff  threshold 
would  vary  with  the  size  of  the  documents.  It  is  not  obvious  at 
this  point  what  good  method  could  be  used  to  filter  out  the  high 
fregeuEcy  types  and  at  the  same  time  not  imposing  so  much 
overhead  that  would  defeat  the  whole  purpose  of  a  high  cutoff. 
Introducing  a  stop  list  of  the  usual  high  frequency  words  like 
•the',  'a',  etc.  would  help  but  does  not  seem  to  be  very  general 
and  flexible. 

The  merging  algorithm  can  be  improved  also.  At  present, 
merging  of  the  statistics  files  for  the  documents  is  done  by 
2-way  merges  and  intermediate  files  are  generated  for  each 
merge.  The  generation  of  files  hampers  system  performance  as  the 
openning  and  closing  of  files  are  slow  processes.  One  possible 
way  to  avoid  having  to  open  and  close  files  so  frequently  is  to 
use  a  single  intermediate  file  for  all  the  merges  within  the 
same  pass  rather  than  using  1  intermediate  file  per  2-way  merge. 
This  would,  however,  require  a  much  more  complicated   bookeeping 
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algorithm  than  the  current  one  and  restructuring  of  some  of  the 
systems  modules  concerned  might  be  necessary. 

One  final  point  that  deserves  attention  is  that  we  have 
talked  about  vectorizing  documents.  Our  objective  is  to  use  as 
few  words  as  possible  to  characterize  a  document's  contents.  A 
mere  high  and  low  cutoff  as  discussed  earlier  might  not  be 
sufficient  for  our  purpose.  Maybe  some  additional  screening 
using  weighing  factors  based  on  a  type's  token  and  document 
freguencies  would  help.  This  information  is  readily  available 
from  HISTOGRAM. 

As  mentioned  earlier,  HISTOGRAM  is  meant  to  be  a  dynamic 
system.  It  is  hoped  that  through  experimentation  with  this 
facility,  solutions  to  some  of  the  problems  encountered  in 
on-line  retrieval  systems  can  be  found. 
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APPENDIX  A  -  GLOSSARY  OF  TERHS 

recall  ratio:  the  ratio  of  retrieved  relevant  documents  to  the 
total  number  of  relevant  documents. 

precision  ratio:  the  ratio  of  retrieved  relevant  documents  to 
the  number  of  retrieved  documents  in  the  data  base. 

token:  any  unbroken  string  of  alphanumeric  characters, 
type:  a  distinct  token- 
token  frequency:  the  number  of  times  a  token  repeats  itself, 
document  frequency:  the  number  of  documents  a  type  appears  in. 

type  rank:  the  standing  of  a  type  in  a  document  set  according 
to  its  token  frequency.  The  lowest  rank  type  in  a 
document  always  has  the  highest  token  frequency  in  a 
set. 

query:  a  search  request. 

query  language:  the  repertoire  of  commands  in  a  retrieval 
system. 

search  term:  a  word  used  in  a  search  request. 

search  expression:  a  syntactically  correct  combination  of  the 
desired  contents  of  documents  to  be  retrieved  from 
the  data  base. 

responding  documents:  the  retrieved  documents  in  a  search. 

document  set:  a  collection  of  documents. 

thesaurus  classes:  groups  of  related  words,  not  necessarily 
synonymous. 
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